Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Jun 20, 2020

What changes were proposed in this pull request?

This patch applies NormalizeFloatingNumbers to distinct aggregate to fix a regression of distinct aggregate on NaNs.

Why are the changes needed?

We added NormalizeFloatingNumbers optimization rule in 3.0.0 to normalize special floating numbers (NaN and -0.0). But it is missing in distinct aggregate so causes a regression. We need to apply this rule on distinct aggregate to fix it.

Does this PR introduce any user-facing change?

Yes, fixing a regression of distinct aggregate on NaNs.

How was this patch tested?

Added unit test.

@SparkQA

This comment has been minimized.

@viirya
Copy link
Member Author

viirya commented Jun 20, 2020

cc @cloud-fan @dongjoon-hyun

val distinctAttributes = namedDistinctExpressions.map(_.toAttribute)
// Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
// `groupingExpressions` is not extracted during logical phase.
val normalizednamedDistinctExpressions = namedDistinctExpressions.map { e =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val normalizednamedDistinctExpressions = namedDistinctExpressions.map { e =>
val normalizedNamedDistinctExpressions = namedDistinctExpressions.map { e =>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a basic catalyst question and feel free to send me away. The question is what about being "named" is a requirement in this case. I bet it has to do with expression binding, but I am not entirely sure, and was wondering if you had that answer since you had to special case it here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are questioning about why we need to have named expressions here, I think it is because we need these distinct expressions to be in the result expressions in Aggregate physical operator. These result expressions are for the output attributes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time @viirya. I am not 100% sure when all the cases that need named expression, but that the physical node output expressions need to be named, makes sense to me. Seems like any downstream node that needs to refer to an output needs things like ExprId in order to distinguish fields.

@abellina
Copy link
Contributor

@viirya thanks for looking at this issue.

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @viirya .

@dongjoon-hyun
Copy link
Member

cc @gatorsmile too since this is a correctness issue at 3.0.0. I believe we need to include this in 3.0.1.
cc @HyukjinKwon since he is interested in 3.0.1 release.

@viirya
Copy link
Member Author

viirya commented Jun 20, 2020

@abellina @dongjoon-hyun Thanks for comment.

@HyukjinKwon
Copy link
Member

Looks right to me

@SparkQA

This comment has been minimized.

Copy link
Contributor

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@HyukjinKwon
Copy link
Member

retest this please

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you so much, @viirya .

@SparkQA
Copy link

SparkQA commented Jun 22, 2020

Test build #124354 has finished for PR 28876 at commit bc159ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

dongjoon-hyun pushed a commit that referenced this pull request Jun 22, 2020
…nct aggregate

### What changes were proposed in this pull request?

This patch applies `NormalizeFloatingNumbers` to distinct aggregate to fix a regression of distinct aggregate on NaNs.

### Why are the changes needed?

We added `NormalizeFloatingNumbers` optimization rule in 3.0.0 to normalize special floating numbers (NaN and -0.0). But it is missing in distinct aggregate so causes a regression. We need to apply this rule on distinct aggregate to fix it.

### Does this PR introduce _any_ user-facing change?

Yes, fixing a regression of distinct aggregate on NaNs.

### How was this patch tested?

Added unit test.

Closes #28876 from viirya/SPARK-32038.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 2e4557f)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Thank you all. Merged to master/3.0.

@maropu
Copy link
Member

maropu commented Jun 23, 2020

late LGTM, thanks, @viirya

revans2 pushed a commit to NVIDIA/spark-rapids that referenced this pull request Jun 24, 2020
SPARK-32038 reports a regression in Apache Spark (3.0.0), in
failing to normalize NaN/Zero float values, during DISTINCT
aggregations. This causes a mismatch in results between
Apache Spark 3.0.0 on CPU, and the Rapids Accelerator (which
returns the right results).
SPARK-32038 was fixed in apache/spark#28876.

This commit introduces a conditional xfail test that passes
on Apache Spark 3.0.1 and 3.1+ (which fixes SPARK-32038),
but produces an expected failure on Spark 3.0.0.
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
SPARK-32038 reports a regression in Apache Spark (3.0.0), in
failing to normalize NaN/Zero float values, during DISTINCT
aggregations. This causes a mismatch in results between
Apache Spark 3.0.0 on CPU, and the Rapids Accelerator (which
returns the right results).
SPARK-32038 was fixed in apache/spark#28876.

This commit introduces a conditional xfail test that passes
on Apache Spark 3.0.1 and 3.1+ (which fixes SPARK-32038),
but produces an expected failure on Spark 3.0.0.
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
SPARK-32038 reports a regression in Apache Spark (3.0.0), in
failing to normalize NaN/Zero float values, during DISTINCT
aggregations. This causes a mismatch in results between
Apache Spark 3.0.0 on CPU, and the Rapids Accelerator (which
returns the right results).
SPARK-32038 was fixed in apache/spark#28876.

This commit introduces a conditional xfail test that passes
on Apache Spark 3.0.1 and 3.1+ (which fixes SPARK-32038),
but produces an expected failure on Spark 3.0.0.
@viirya viirya deleted the SPARK-32038 branch December 27, 2023 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants